Introduction

The goals of this exercise are to: - Introduce you to bayesian analysis using revbayes

This file will not knit until all the results from RevBayes are in a folder with it. You can click Run > Run All to run the code up to what you have completed as you step through the exercise.

RevBayes is a very flexible tool for phylogenetic analyses. You have extensive control over which features of your model are constant, deterministic, or stochastic. This allows RevBayes to do a standard Bayesian phylogenetic inference, but it can also do much more including time calibration, ancestral character state reconstruction, and phylogenetic comparative analyses.

This flexibility comes with greater complexity than many other tools. For example, the models are explicitly specified rather than just called by name (eg GTR). The configurations are placed in RevBayes script files that typically end with the file extension .Rev. These files contain a series of commands written in RevBayes’ own language, which has a syntax similar to R. For an introduction to how to specify models and implement analyses see the documentation on Getting Started with RevBayes tutorial. For an overview of all the commands see the documentation.

A typical RevBayes script has commands that load the data, set up the model, define the analysis, run the analysis, and control what is in the output and where it goes. For routine analyses you can modify the basics of existing files, changing the input and output files for example. For more more specialized analyses you can extensively revise and expand these files, or write them from scrap.

There is a series of detailed tutorials that serve as a great starting point for using RevBayes. This exercise is based off of the Nucleotide substitution models tutorial.

Software

RevBayes can be difficult to install because it requires an external library, boost. I suggest using it on a remote cluster.

We will also use two desktop tools to view the results. Both require java, which you will need to install if you don’t have it already.

Install these desktop tools on your computer:

Exercise files

This exercise folder contains:

Following the tutorial

This exercise is a bit different than the others in that you are fallowing another existing tutorial – Nucleotide substitution models. The problems listed here related to that tutorial and the steps you take there.

That tutorial is written as if you are running the commands on your own computer. We will run it on the cluster. This requires some deviations from the instructions in the tutorial. Rather than launch the scripts by opening RevBayes and running a source() command, you will use the job*.sh slurm scripts to launch revbayes and run the .Rev scripts. To run the JC analysis, for example, lanunch the job on the cluster with:

sbatch job_jc.sh

In the interest of time, we will skip the HKY model. Feel free to go through it if you like.

Summarizing MCMC Samples

To understand the MCMC sampling, plot some key variables recorded in the log file through generations. Download the output file from the cluster and place it on your computer in the same folder as this file.

You can view the log files with Tracer, or with the code below.

We will plot the summary MCMC stats right here in R. Each replicate is shown as a different color.

trace_jc = read_tsv("revbayes_output/output_jc/primates_cytb_JC.log")
## Rows: 40004 Columns: 49
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (49): Iteration, Replicate_ID, Posterior, Likelihood, Prior, bl[1], bl[2...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
trace_jc %<>%                           # Read the data
  mutate(Replicate_ID = as.factor(Replicate_ID))                       # Convert the replicate id to a factor
  trace_jc %<>% mutate(Iteration = Iteration %/% length(levels(trace_jc$Replicate_ID)))  # The iterations are sequenctial across all replicates. Rescale them so that they are in number of generations per replicate, rather than total number of samples

Now let’s plot Tree length, TL, as in the tutorial:

trace_jc %>%
  ggplot(aes(x=Iteration, y=TL, col=Replicate_ID), alpha=0.2) + geom_line()

trace_jc %>%
  ggplot(aes(x=Iteration, y=Likelihood, col=Replicate_ID), alpha=0.2) + geom_line()

These plots show all iteration, ie generations. These include the first MCMC iterations before the run has burned in. By default, the readTreeTrace that is used in our scripts to import trees to summarize them discards the phylogenies from the first 25% of iterations as a burn in. Here we will do the same, and replot:

trace_jc %>%
  filter( Iteration > 2500 ) %>%
  ggplot(aes(x=Iteration, y=TL, col=Replicate_ID), alpha=0.2) + geom_line()

trace_jc %>%
  filter( Iteration > 2500 ) %>%
  ggplot(aes(x=Iteration, y=Likelihood, col=Replicate_ID), alpha=0.2) + geom_line()

Now we will also look at the post-burning distribution of TL and Likelihood

trace_jc %>%
  filter( Iteration > 2500 ) %>%
  ggplot(aes(TL, fill = Replicate_ID, colour = Replicate_ID)) +
    geom_density(alpha = 0.1)

trace_jc %>%
  filter( Iteration > 2500 ) %>%
  ggplot(aes(Likelihood, fill = Replicate_ID, colour = Replicate_ID)) +
    geom_density(alpha = 0.1)

Problem 1: (2 points) Answer the following questions about the MCMC summaries for the JC model above.

Open the tree output_jc/primates_cytb_JC_MAP.tree on your local computer with FigTree to inspect the topology, edge length, and posterior probabilities. You will need to check the Node Labels box and then in Display: select posterior.

Problem 2: (1.5 points) Answer the following questions for the GTR analyses. Copy and paste the codeblocks from above and modify them to import these results and plot them, or use Tracer to view the results. Pick at least one other model paraemter to examine in addition to TL and Likelihood. Note that this analysis has just a single replicate.

trace_gtr = read_tsv("revbayes_output/output_gtr/primates_cytb_GTR.log")
## Rows: 1001 Columns: 58
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (58): Iteration, Posterior, Likelihood, Prior, bl[1], bl[2], bl[3], bl[4...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
trace_gtr %>%
  ggplot(aes(x=Iteration, y=TL), alpha=0.2) + geom_line()

trace_gtr %>%
  ggplot(aes(x=Iteration, y=Likelihood), alpha=0.2) + geom_line()

trace_gtr %>%
  ggplot(aes(x=Iteration, y=Prior), alpha=0.2) + geom_line()

trace_gtr %>%
  filter( Iteration > 2500 ) %>%
  ggplot(aes(x=Iteration, y=TL), alpha=0.2) + geom_line()

trace_gtr %>%
  filter( Iteration > 2500 ) %>%
  ggplot(aes(x=Iteration, y=Likelihood), alpha=0.2) + geom_line()

trace_gtr %>%
  filter( Iteration > 2500 ) %>%
  ggplot(aes(x=Iteration, y=Prior), alpha=0.2) + geom_line()

trace_gtr %>%
  filter( Iteration > 2500 ) %>%
  ggplot(aes(TL)) +
    geom_density(alpha = 0.1)

trace_gtr %>%
  filter( Iteration > 2500 ) %>%
  ggplot(aes(Likelihood)) +
    geom_density(alpha = 0.1)

trace_gtr %>%
  filter( Iteration > 2500 ) %>%
  ggplot(aes(Prior)) +
    geom_density(alpha = 0.1)

Problem 3: (1.5 points) Answer the following questions for the GTR+IG analyses. Copy and paste the codeblocks from above and modify them to import these results and plot them, or use Tracer to view the results. Examine alpha in addition to TL and likelihood. Note that this analysis has just a single replicate.

trace_gtrig = read_tsv("revbayes_output/output_gtrig/primates_cytb_GTRGI.log")
## Rows: 1001 Columns: 64
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (64): Iteration, Posterior, Likelihood, Prior, alpha, bl[1], bl[2], bl[3...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
trace_gtrig %>%
  ggplot(aes(x=Iteration, y=TL), alpha=0.2) + geom_line()

trace_gtrig %>%
  ggplot(aes(x=Iteration, y=Likelihood), alpha=0.2) + geom_line()

trace_gtrig %>%
  ggplot(aes(x=Iteration, y=alpha), alpha=0.2) + geom_line()

trace_gtrig %>%
  filter( Iteration > 2500 ) %>%
  ggplot(aes(x=Iteration, y=TL), alpha=0.2) + geom_line()

trace_gtrig %>%
  filter( Iteration > 2500 ) %>%
  ggplot(aes(x=Iteration, y=Likelihood), alpha=0.2) + geom_line()

trace_gtrig %>%
  filter( Iteration > 2500 ) %>%
  ggplot(aes(x=Iteration, y=alpha), alpha=0.2) + geom_line()

trace_gtrig %>%
  filter( Iteration > 2500 ) %>%
  ggplot(aes(TL)) +
    geom_density(alpha = 0.1)

trace_gtrig %>%
  filter( Iteration > 2500 ) %>%
  ggplot(aes(Likelihood)) +
    geom_density(alpha = 0.1)

trace_gtrig %>%
  filter( Iteration > 2500 ) %>%
  ggplot(aes(alpha)) +
    geom_density(alpha = 0.1)

Submit

Make sure you have all files in this folder and have knit the document. Compress the folder with zip, and submit it on canvas.